168 research outputs found

    Building a coreference-annotated corpus from the domain of biochemistry

    Get PDF

    TIMELINE: Exhaustive Annotation of Temporal Relations Supporting the Automatic Ordering of Events in News Articles

    Full text link
    Temporal relation extraction models have thus far been hindered by a number of issues in existing temporal relation-annotated news datasets, including: (1) low inter-annotator agreement due to the lack of specificity of their annotation guidelines in terms of what counts as a temporal relation; (2) the exclusion of long-distance relations within a given document (those spanning across different paragraphs); and (3) the exclusion of events that are not centred on verbs. This paper aims to alleviate these issues by presenting a new annotation scheme that clearly defines the criteria based on which temporal relations should be annotated. Additionally, the scheme includes events even if they are not expressed as verbs (e.g., nominalised events). Furthermore, we propose a method for annotating all temporal relations -- including long-distance ones -- which automates the process, hence reducing time and manual effort on the part of annotators. The result is a new dataset, the TIMELINE corpus, in which improved inter-annotator agreement was obtained, in comparison with previously reported temporal relation datasets. We report the results of training and evaluating baseline temporal relation extraction models on the new corpus, and compare them with results obtained on the widely used MATRES corpus.Comment: Accepted for publication in EMNLP 2023: 13 pages, 3 figures and 14 table

    Learning to Play Chess from Textbooks (LEAP): a Corpus for Evaluating Chess Moves based on Sentiment Analysis

    Full text link
    Learning chess strategies has been investigated widely, with most studies focussing on learning from previous games using search algorithms. Chess textbooks encapsulate grandmaster knowledge, explain playing strategies and require a smaller search space compared to traditional chess agents. This paper examines chess textbooks as a new knowledge source for enabling machines to learn how to play chess -- a resource that has not been explored previously. We developed the LEAP corpus, a first and new heterogeneous dataset with structured (chess move notations and board states) and unstructured data (textual descriptions) collected from a chess textbook containing 1164 sentences discussing strategic moves from 91 games. We firstly labelled the sentences based on their relevance, i.e., whether they are discussing a move. Each relevant sentence was then labelled according to its sentiment towards the described move. We performed empirical experiments that assess the performance of various transformer-based baseline models for sentiment analysis. Our results demonstrate the feasibility of employing transformer-based sentiment analysis models for evaluating chess moves, with the best performing model obtaining a weighted micro F_1 score of 68%. Finally, we synthesised the LEAP corpus to create a larger dataset, which can be used as a solution to the limited textual resource in the chess domain.Comment: 27 pages, 10 Figures, 9 Tabel

    Semantics Altering Modifications for Evaluating Comprehension in Machine Reading

    Full text link
    Advances in NLP have yielded impressive results for the task of machine reading comprehension (MRC), with approaches having been reported to achieve performance comparable to that of humans. In this paper, we investigate whether state-of-the-art MRC models are able to correctly process Semantics Altering Modifications (SAM): linguistically-motivated phenomena that alter the semantics of a sentence while preserving most of its lexical surface form. We present a method to automatically generate and align challenge sets featuring original and altered examples. We further propose a novel evaluation methodology to correctly assess the capability of MRC systems to process these examples independent of the data they were optimised on, by discounting for effects introduced by domain shift. In a large-scale empirical study, we apply the methodology in order to evaluate extractive MRC models with regard to their capability to correctly process SAM-enriched data. We comprehensively cover 12 different state-of-the-art neural architecture configurations and four training datasets and find that -- despite their well-known remarkable performance -- optimised models consistently struggle to correctly process semantically altered data.Comment: AAAI 2021, final version. 7 pages content + 2 pages reference

    Towards End-User Development for IoT: A Case Study on Semantic Parsing of Cooking Recipes for Programming Kitchen Devices

    Full text link
    Semantic parsing of user-generated instructional text, in the way of enabling end-users to program the Internet of Things (IoT), is an underexplored area. In this study, we provide a unique annotated corpus which aims to support the transformation of cooking recipe instructions to machine-understandable commands for IoT devices in the kitchen. Each of these commands is a tuple capturing the semantics of an instruction involving a kitchen device in terms of "What", "Where", "Why" and "How". Based on this corpus, we developed machine learning-based sequence labelling methods, namely conditional random fields (CRF) and a neural network model, in order to parse recipe instructions and extract our tuples of interest from them. Our results show that while it is feasible to train semantic parsers based on our annotations, most natural-language instructions are incomplete, and thus transforming them into formal meaning representation, is not straightforward.Comment: 8 pages, 1 figure, 2 tables. Work completed in January 202

    Extracting granular information on habitats and reproductive conditions of Dipterocarps through pattern-based literature analysis

    Get PDF
    Lowland tropical rainforests in Southeast Asia primarily comprised of dipterocarp species are one of the most threatened ecosystems in the world. Belonging to the family Dipterocarpaceae, dipterocarps are economically and ecologically important due to their timber value as well as contribution to wildlife habitat. The challenge in the restoration and rehabilitation of these Dipterocarp forests lies in their complex reproduction patterns, i.e., supra-annual mass flowering events that may occur in irregular intervals of two to ten years, possibly synchronously across Asia. Understanding their regeneration to make plans for effective reforestation can be aided by providing access to a comprehensive database that contains long-term and wide-scale data on dipterocarps. The content of such a database can be enriched with literature-derived information on habitats and reproductive conditions of dipterocarps. We aim to develop literature mining methods to automatically extract information relevant to the distribution and reproductive cycle of dipterocarps, in order to help predict the likelihood of their regeneration, and subsequently make informed decisions regarding species for reforestation. In previous work, we developed a machine learning-based named entity recognition (NER) model that automatically annotates entities relevant to species’ distribution, e.g., taxon names, geographic locations, temporal expressions, habitats, authorities, and names of herbaria. Furthermore, the species’ reproductive condition, e.g., whether it is sterile or in the state of producing fruit ("in fruit") or flower ("in flower"), was also automatically annotated to enable the derivation of phenological patterns. The model was trained on a manually annotated corpus of documents, e.g., scholarly articles and government agency reports. In this work, we focus our efforts specifically on the extraction of relationships between habitats and their locations, and between reproductive conditions and temporal expressions. To this end, we have developed a syntactic pattern-based matching approach by building upon Grew (http://grew.fr/), a graph rewriting system for manipulating linguistic representations. For our purposes, patterns that made use of syntactic dependencies, part-of-speech tags and named entity types (derived from NER results) were designed. When fed into Grew, these patterns were able to analyse sentences in scholarly articles by associating habitats with their geographic locations, and by determining a species’ reproductive condition at a specific point in time. The resulting relationships are then used to enrich information contained in a database of dipterocarp occurrences. Such a resource will provide more comprehensive ecological data that could form the basis of more informed reforestation decisions
    • …
    corecore